-
Notifications
You must be signed in to change notification settings - Fork 12.4k
PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs? |
thanks for your kind comment.
|
Nice work! However, will offloading MatMul to CDSP layer by layer incur a significant FastRPC overhead? As far as I know, with zero-copy enabled, the overhead of a single FastRPC call is around 1–3ms (on the 8750 device). Could you please share the end-to-end performance results? |
1.can be avoid by the same mechanism in QNN SDK: shared-buffer or memory pool between AP and cDSP through ION memory or DMA memory.
what's the means of "end-to-end performance results"?
modify the inference_approach manually for verify approach through QNN or cDSP:
|
Thanks for your feedback!
I understand using techs like shared-buffer will reduce fastrpc overhead (as I mentioned "with zero-copy enabled"). However, LLM inference requires offloading hundreds of layers. Even with shared-buffer, this overhead remains significant.So, what I’d like to ask is whether you have any other optimization strategies to address this issue. I see that you have actually analyzed and found it infeasible to transfer the entire graph to QNN. However, if offloading is done on a per-layer basis, it seems that the overhead of these fastRPCs can never be optimized.
I'm sorry for this confusing question. End to End performance results means the prefill & decode speed (tokens/s). I see your comments mentioned some results, but these results seems incomplete.
Thank you for providing these running commands, I'll try it. Additionally, I would like to provide an extra piece of opinion. The reason why performance on Qualcomm's official AI hub is good enough. I think it's because they convert the entire graph to QNN, which gives them three exclusive advantages in hardware utilization:
|
By the way, I took a quick look at the operator code you implemented on cdsp and it seems that you haven't used HVX intrinsics yet. This might be one of the reasons for the suboptimal performance. I suggest you take a look at the linear algebra library provided by Qualcomm for cdsp. Although their performance is not great either. |
the overhead of FastRPC should be same in various tech approaches, I personally think the overhead through cDSP directly might-be minimum: datapath through QNN:
datapath through cDSP directly:
I think that's why the senior staff tech expert from Qualcomm headquarter said "QNN is not the right solution here", in the fact, the NPU performance through QNN is really bad here(ggml / llama.cpp). because we can't utilize the dedicated binary tools which provided by Qualcomm here, or we(llama.cpp community) can't re-create an entire Qualcomm's dedicated AI stack in ggml/llama.cpp(mapping the all ggml ops and entire ggml cgraph to a single QNN graph is also not easy to Qualcomm's world-class engineering team, I'll prefer to do some adaptation efforts with Intel's sycl stack if I'm a regular employee of Qualcomm's AI team, that's a more practical and desire direction). accordingly, offload some performance-sensitive ggml ops to cDSP directly is a practical way/direction: we only need to focus on hexagon kernels through various highly-designed algorithms or HVX instructions on cDSP, this is also the key-reason why I think this PR should be approved(we don't need to do more complex things in ggml-qnn.cpp from now on). at the same time, we can clearly see that the so-called FastRPC mechanism or framework is exactly similar to mechanism in TEE.
because QNN SDK's internal will indirectly calling Qualcomm's Hexagon nn libs on cDSP which might-be/should-be highly-optimized with HVX SIMD instructions(they have plenty of excellent software engineers or AI experts).
I agree with your opinion: QNN's internal did some things with the specified single QNN graph you pointed out. pls refer to section "Big picture of ggml-hexagon backend" in this PR, or refer to:jeffzhou2000#24 |
Can Hexagon SDK not be used in termux environment? |
Hexagon SDK need to be used in a standard Linux environment(Ubuntu 20.04/22.04 is recommended) to build the entire ggml-hexagon source code in this PR, pls refer to section "How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone" in PR description. btw, I'm not familiar with termux environment but I think that's a limited Linux running environment on Android device. in the fact, we can verify/validate this PR through the self-made script build-run-android.sh on a standard Linux OS(or through Linux VM on Window10/11, or through WSL2 on Windows10/11) easily and directly, so termux might-be not needed or suitable here(I'm not sure about this because I didn't verify this opinion).
currently, it need Qualcomm account to get the Hexagon SDK to build the entire ggml-hexagon source code, this is the key-reason why build-run-android.sh can't download Hexagon SDK automatically(the self-made script build-run-android.sh can download Android NDK and QNN SDK automatically to make workflow easily). at the same time, I cannot share the local self-made zip format Hexagon SDK publicly or I cannot distribute the Hexagon SDK in another way, because the Hexagon SDK must be obtained with a Qualcomm Developer Account and the self-made local zip format Hexagon SDK contains/binds my personal Qualcomm account&license information. this IPR policy is make sense(my personal opinion) because Qualcomm already(we developers and programmers must thanks to @slaren) public it's unlimited QNN SDK freely and it is clear that the Hexagon SDK is more valuable although the Hexagon SDK is also free to developers and Qualcomm need to know how many developers use their unique and valuable Hexagon SDK. of course, it will make developer's workflow more easily if Qualcomm can also public an unlimited Hexagon SDK because I also think "more people/companies use the world-class Snapdragon mobile/vehicle/desktop SoC is another key-point". |
@zhouwg Thank you for your answer. It seems that the only way is to unzip it using googlecolab. |
I would like to specify the termux lib path, but it is fixed and I would like to change it. ggml-hexagon.cpp .runtimelib_path = "/data/local/tmp/", QNN_DEFAULT_LIB_SEARCH_PATH |
thanks for your comment, what's your suggestion for this problem? thanks. |
Is it not enough to just replace it with QNN_DEFAULT_LIB_SEARCH_PATH? |
When I copied what I built on GoogleColab to Termux and ran it, I got the following error. device: GT5 pro(snapdragon gen3) [bin]$ ls $PREFIX/lib/libQnn
libQnnCpu.so
libQnnGpu.so
libQnnHtp.so
libQnnHtpNetRunExtensions.so
libQnnHtpOptraceProfilingReader.so
libQnnHtpPrepare.so
libQnnHtpProfilingReader.so
libQnnHtpV68CalculatorStub.so
libQnnHtpV68Skel.so
libQnnHtpV68Stub.so
libQnnHtpV69CalculatorStub.so
libQnnHtpV69Skel.so
libQnnHtpV69Stub.so
libQnnHtpV73CalculatorStub.so
libQnnHtpV73Skel.so
libQnnHtpV73Stub.so
libQnnHtpV75CalculatorStub.so
libQnnHtpV75Skel.so
libQnnHtpV75Stub.so
libQnnHtpV79CalculatorStub.so
libQnnHtpV79Skel.so
libQnnHtpV79Stub.so
libQnnSystem.so
[bin]$ LD_LIBRARY_PATH=".:/vendor/lib64" ./llama-server
[ggmlhexagon_load_cfg, 1756]: load hexagon appcfg from
/data/data/com.termux/files/usr/lib/ggml-hexagon.cfg
[operator(), 1762]: section[cdsp ],[enable_rpc_dma
_mempool ] = [0]
[operator(), 1762]: section[cdsp ],[enable_rpc_ion
_mempool ] = [0]
[operator(), 1762]: section[qnn ],[precision_mode
] = ["fp16"]
[operator(), 1762]: section[qnn ],[enable_dlbc
] = [1]
[operator(), 1762]: section[qnn ],[vtcm_size_in_m
b ] = [8]
[operator(), 1762]: section[qnn ],[hvx_threads
] = [4]
[operator(), 1762]: section[general ],[hwaccel_approa
ch ] = [2]
[operator(), 1762]: section[general ],[print_qnn_inte
rnal_log ] = [0]
[operator(), 1762]: section[general ],[enable_q_mulma
t ] = [0]
[operator(), 1762]: section[general ],[print_tensors_
info ] = [0]
[operator(), 1762]: section[general ],[hexagon_backen
d ] = [2]
[operator(), 1762]: section[general ],[dump_op_info
] = [0]
[operator(), 1762]: section[general ],[enable_perf
] = [1]
[operator(), 1762]: section[general ],[version
] = ["1.00"]
[ggmlhexagon_load_cfg, 1786]: internal ggml_hexagon_ver
sion=1.80
[ggmlhexagon_load_cfg, 1787]: internal ggml_dsp_version
=0.60
[ggmlhexagon_load_cfg, 1788]: external ggml_hexagon_ver
sion=1.00
[ggmlhexagon_load_cfg, 1790]: hwaccel_approach=2(HWACCE
L_CDSP)
[ggmlhexagon_load_cfg, 1792]: hexagon_backend=2(HEXAGON
_BACKEND_CDSP)
[ggmlhexagon_load_cfg, 1793]: runtime libpath=/data/dat
a/com.termux/files/usr/lib/
[ggmlhexagon_load_cfg, 1794]: enable_perf=1
[ggmlhexagon_load_cfg, 1795]: enable_profiler=0
[ggmlhexagon_init_dsp, 5209]: init Hexagon cDSP with ba
ckend 2(HEXAGON_BACKEND_CDSP)
[ggmlhexagon_init_dsp, 5280]: using Hexagon domain 3(He
xagon-cDSP)
[ggmlhexagon_init_dsp, 5281]: unsignedpd_enabled 1
[ggmlhexagon_init_dsp, 5327]: error 0x80000406: failed
to open domain 3(Hexagon-cDSP)
[ggmlhexagon_deinit_cdsp, 5172]: enter ggmlhexagon_dein
it_cdsp
[ggmlhexagon_deinit_cdsp, 5185]: leave ggmlhexagon_dein
it_cdsp
[ggml_backend_hexagon_reg, 6376]: init hexagon dsp fail
ure
/root/.builder/source/bin/llama-cpp-hexagon-branch-pr_t
o_upstream/ggml/src/ggml-hexagon/ggml-hexagon.cpp:6378:
GGML_ASSERT(0 == result) failed |
env var might-be a good idea although it seems equivalent to a new runtime configuration item in scripts/ggml-hexagon.cfg. there are many runtime configuration items at the moment, an uniform approach might-be better for developers and users. |
…orks in a standard Android APP)
…accel_approach) in ggml-hexagon.h for further usage
* [ ] Low
* [x] Medium(complexity of codes on ARM-AP side is medium, complexity of codes on cDSP side(hexagon-kernels) is high
* [ ] High
* [x]
test-backend-ops
andllama-cli
through HWACCEL_QNN on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone* [x]
test-backend-ops
andllama-cli
through HWACCEL_CDSP on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone* [x] the major features in ggml backend subsystem through HWACCEL_CDSP(the main approach in this PR) has verified on Qualcomm Snapdragon 8Gen3 & 8Elite equipped Android phone
PR Description
this PR is a continued effort of my original PR #6869 on 04/2024, focus on the final mission:
refer to other existing backends, this PR is the initial phase of ggml-hexagon backend for Qualcomm Hexagon NPU on Android phone. now it's already a functional/practical MVP(Minimum Viable PR) PR: it support GGML_OP_ADD and GGML_OP_MULMAT and can pass testops and llama-cli and performance of GGML_OP_ADD and GGML_OP_MUL_MAT with fp32 on cDSP side are both very positive.
the fully and TLDR description of this PR can be found at my forked llama.cpp project:jeffzhou2000#30.
the high-level data path or so-called high-level arch of ggml-hexagon can be found at my forked llama.cpp project:high-level data path of ggml-hexagon
Features
provide a concise reference implementation of HWACCEL_QNN in this PR: offload ggml op to QNN.
provide a very fast approach(HWACCEL_CDSP) which is exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl in this PR: offload some performance-sensitive ggml ops to Hexagon cDSP directly.
the Hexagon NPU performance between HWACCEL_QNN approach and HWACCEL_CDSP approach can be easily compared:provide a computation visualization approach in this PR to help other developers and AI experts to visualize the comparison between cDSP approach and QNN approach.
dynamic running parameter adjustment through ggml-hexagon.cfg(this idea comes from @ngxson in his draft AI-dedicated PR and more parameters can be added in this configuration file).

probe/detect Snapdragon SoC information at runtime, accordingly, code might-be/should-be running well on following Qualcomm dsp:


#v68 --- Snapdragon 888
#v69 --- Snapdragon 8 Gen1
#v73 --- Snapdragon 8 Gen2
#v75 --- Snapdragon 8 Gen3(verified)
#v79 --- Snapdragon 8 Elite(aka 8 Gen4) (verified)
provide a customized tiny ggmldsp which is exactly borrowed/reused/ported from original ggml and running well /works fine on Hexagon cDSP side, this feature will be very helpful for domain experts or AI experts whom can do anything AI innovation with Qualcomm's amazing lightweight/low-level(C/C++ and HVX assemble and can operate hardware directly) Hexagon SDK on cDSP side directly rather than learning Qualcomm's highly-designed heavyweight/high-level QNN SDK API on ARM-AP side.
provide big picture of ggm-hexagon backend in this PR for further or other relative dev activity in this great pure-tech community.
How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone
Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions or Linux VM or WSL on Windows10/11 might be also ok):
utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically, Qualcomm Hexagon SDK must be obtained with a Qualcomm Developer Account and cannot be downloaded automatically in this script.
we will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:
SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750-AB (Snapdragon 8 Elite) (aka Snapdragon 8 Gen 4)
we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-hexagon".
Hexagon NPU Performance
test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.32.0.250228, Hexagon SDK is v6.2.0.1.
case-1: GGML_OP_ADD performance comparison between QNN-NPU and cDSP in real LLM inference
case-2: GGML_OP_MUL_MAT performance comparison between QNN-NPU and cDSP(small matrix mulmat through test-backend-ops)
[updated on 04/09/2025,09:19] I suddenly found that QNN-NPU's performance was significantly improved after I upgrade QNN SDK to v2.33.0.250327.
test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. QNN SDK is v2.33.0.250327, Hexagon SDK is v6.2.0.1.
the details and how to reproduce above results can be found at my forked llama.cpp project:jeffzhou2000#28.
Big picture of ggml-hexagon backend
there are three tech approaches to implement the ggml-hexagon backend for Qualcomm's Hexagon NPU:
the tech details of "the special approach through QNN" can be found at my forked llama.cpp project:jeffzhou2000#24.
10+ reasons why I think HWACCEL_CDSP is correct direction can be found at my forked llama.cpp project:jeffzhou2000#28.
Acknowledgement
Conclusion
after spent too much efforts on ggml-hexagon backend, I personally think:
[updated on 04/02/2025, 22:18] @ggerganov @slaren, sorry to bother you, I understand your time are both valuable, could you help to modify the label of this PR to "Qualcomm NPU" and remove the lable "testing" and "script" and "build"? thanks so much!